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Abstract — In this paper we study the multisource multicast 
problem where every sink in a given directed acyclic graph is 
a client and is interested in a common file. We consider the 
case where each node can have partial knowledge about the file 
as a side information. Assuming that nodes can communicate 
over the capacity constrained links of the graph, the goal is 
for each client to gain access to the file, while minimizing some 
linear cost function of number of bits transmitted in the network. 
We consider three types of side-information settings: (ii) side 
information in the form of linearly correlated packets; and (iii) 
the general setting where the side information at the nodes have 
an arbitrary (i.i.d.) correlation structure. In this work we 1) 
provide a polynomial time feasibility test, i.e., whether or not all 
the clients can recover the file, and 2) we provide a polynomial- 
time algorithm that finds the optimal rate allocation among the 
links of the graph, and then determines an explicit transmission 
scheme for cases (i) and (ii). 

I. Introduction 

We consider a multi-source multicast problem, over a given 
network topology with capacity constrained links. There are 
two types of nodes in the network; clients that are interested 
in recovering the whole content, and source nodes that may 
posses possibly correlated side-information. To further illus- 
trate the problem set-up consider the following example. 

A file consists of four equally sized packets a, b, c and d 
belonging to some finite field F^n . Also, suppose that the data 
packets are distributed across the nodes, nii through m^, that 
are connected as shown in Figure [T] The clients denoted by fi 
and f2 are interested in recovering the entire file. The edges in 
the graph are denoted by ei, . . . , ey as shown in Figure [T] The 
objective is to minimize some function of the communication 
cost such that the clients ti and ^2 can recover the entire file. 
For instance, it can be shown that the following coding scheme 
minimizes the total number of bits communicated: node rni 
transmits a on link 62, node m2 transmits b, c on link 63, 
node 7713 transmits c on link 65, node 7714 transmits a, b, d on 
link eg and a, 6, c, d on link ey. 

Note that the example above considers a simple form of 
the side-information, where different nodes observe partial 
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Fig. 1. An example of the multisource multicast problem, where nodes 
mi , . . . , r7i4 observe the subsets of the file packets {a, b, c, d} as shown 
above. Assuming that nodes can communicate rehably over the capacity 
constrained hnks, the goal is for the clients ii and t2 (sinks of the graph) to 
gain access to the entire file while minimizing the communication cost. 



uncoded or "raw" data packets of the original file. Another 
important special case of side-information is when nodes 
observe linear combinations of the data packets of the original 
file. In a more general setting the side-information can be of 
more complex form i.e., arbitrary correlations. 

The multisource multicast problem was originally studied 
by Ho, et al. (T\, where for the linearly coded packets the 
authors showed under what conditions it is possible to recover 
the file at all destinations. For the case of uncoded packets, it 
is easy to show that one can add a super source as in [T| to 
the network and then using results from [3J, find an optimal 
solution that minimizes the communication cost. In H, IS) the 
authors considered a related problem over an undirected graph 
where all the nodes are interested in recovering the complete 
file. In lO it was shown that the problem is NP-hard, while 
an approximate solution is provided in [5|. In |6|, Haeupler et 
al. analyzed gossip based protocols in networks where each 
node observes correlated data. 

In this paper, we make the following contributions. 

• In the case of most general scenario of arbitrarily corre- 
lated side information, we provide conditions as well as 
a polynomial time algorithm to determine when a mul- 
tisource multicast problem over a given directed acyclic 
graph (DAG) is feasible. 

• Using submodular flow techniques, we provide a deter- 
ministic polynomial time algorithm to find number of bits 
each node should transmit in order to recover the file at all 



the clients and be optimal w.r.t. any linear cost functioiu. 
• For the special case of linearly correlated side information 
we provide an optimal communication scheme based on 
the algebraic network coding framework. 

II. System Model and Preliminaries 

In this work we represent the network by a directed acyclic 
graph G ~ (V, £), where V is the set of nodes, and £ is the set 
of links that have capacity constraints. We define the capacity 
function c : £ — > M to denote the maximum number of bits that 
can be transmitted over a given link. We distinguish between 
two types of nodes: 1) source nodes M = {mi, 7712, . . . , to;} 
that have partial information about the file, and 2) clients 
T = {ti,t2, ■ ■ ■ ,tk} which are interested in recovering the 
file, and are sinks in the graph G. Let Xmi,X„i^, . . . ,Xmi, 
denote the components of a discrete memoryless multiple 
source (DMMS) with a given joint probability mass function. 
Each source node rn^ G A^ observes n i.i.d. realizations of 
the corresponding random variable Xmi, denoted by X^.. 
We note that the results of this paper can be applied in a 
straightforward manner when the clients have side information 
as well. For the sake brevity, we focus on the case when clients 
have no side information. 

The goal is for each client in T to gain access to all source 
nodes' observations, i.e., to download the file. In order to 
achieve this goal, each node m^ G A^ is allowed to send 
information across the graph G at rate which is limited by the 
capacity of the outgoing links of that node. Transmission of 
each source node is a function of its own initial observation 
and all information it receives from its neighbors. Let us 
denote transmission on the link e = {nii, rrij) G 5 by 

Fe = fe {X';^^,{Fa : Vto^ , s.t. o = (jTi^, m,) G £}) , (1) 

where /e(-) is a mapping of the observations X^. and 
transmissions received from the neighbors of rrii, {nir : 
{■mr,mi) G £} to an outgoing message on the link e. 

We denote by Ma Q -M the set of source nodes which are 
connected to the client ti G T- In other words, there exists a 
path in graph G from every node in A^t. to the client ti G 7". 
Consequently, we define the graph Gt^ — ( Vt . , 5t . ) to be a 
subgraph of G, where Vt- = {Mti^U}, and £t^ C £ is a set 
of links that connects all nodes in Ada among themselves and 
with client t^. Furthermore, we assume that 
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HIXm), (2) 



where XMt^ - (Xm, '■ ruj G Mt,), and Xm — 
{Xmi , • • • , Xmi)- Equality (|2|l ensures that every client in the 
network can potentially gain access to the entire process Xm ■ 
For each client ti € T to learn the file, transmissions Fe, 
Ve G £, must satisfy, 

lim -H (XX,|{i^Je=(m„*,)6£) = 0' ^^« e r. (3) 

'Linear cost function is defined w.r.t. tiie number of bits transmitted on 
each link. 



Definition 1. A rate tuple R = {Re : e G £) is an 

achievable multisource multicast (MM) rate vector if there 
exists a communication scheme with transmitted messages 
F = (Fe : e e £) that satisfies (O, and is such that 



Re = lim -H{Fe), Ve G £, 

n— >-00 Jl 



(4) 



where Re < Ce, Ve G £. 



In this work, we design a polynomial time algorithm for the 
multisource multicast problem that minimizes the linear cost 
function J2eeE c^e-Re, where a = (ae : e G £^), < ae < 00, 
Ve G f , is a vector of non-negative finite weights. We allow 
ae's to be arbitrary non-negative constants, to account for the 
case when communication across some group of links in G 
is more expensive compared to the others. Thus, the problem 
can be formulated as: 



mm > 

R ^-^ 



aeRe, S.t. R is an achievable MM-rate vector. (5) 



A. Finite Linear Source Model 

Now, we briefly describe a special case of a DMMS called 
the finite linear source model [7]. Let q be some power of a 



N 



prime. Consider the iV-dimensional random vector W G F „ 
whose components are independent and uniformly distributed 
over the elements of F^n . Then, in the linear source model, 
the observations of the nodes rn^ G A^ is simply given by 



A„,W, TO,; G X, 



(6) 



where A^m G F^*^ is the observation matrix of node mi. 
It is easy to verify that for the finite linear source model, 

H{Xm 



logq" 



rank(A™J. 



(7) 



For the finite linear source model, besides the optimal MM- 
rate vector, we provide a polynomial time code construction 
based on the algebraic network coding approach ||8]. 

III. Multisource Multicast Rate-Flow Region 

In order to solve the optimization problem in (|5) we first 
establish a region called a "rate-flow region" that contains 
all possible optimal rate allocations. To identify this rate-flow 
region for our example of Figure [1] in the case of arbitrarily 
correlated side-information at the source nodes, we start by 
considering a single client ti. Next, we isolate the subgraph 
Gti = {Vtn£ti) corresponding to ti and modify its link 
capacities to infinity as shown in Figure |2] 

Suppose the optimal solution w.r.t. problem (|5]l is achieved 
by R* = {Rl, . . . ,Rq). Then, it foUows that transmissions of 
node 7712 have to satisfy 



R3 ^ J^(Xm2\X'mn Xfji^, Xm^i), 
-^1 + -^2 + -^3 ^ H{Xmn X^^ \Xm3, Xm^)- 



(8) 



Let us now consider node m-4. Its transmission includes 
information received from nodes mi and m2 combined with 
its own side information. Since the goal is to minimize the 




Fig. 2. Single client multisource multicast problem over graph Gt^ = 
{Vti,£ti) derived from the graph G{V,£) of Figure [T] for the case of 
arbitrarily correlated side-information at the source nodes. 



total communication cost, it follows that for the optimal MM- 
rate vector R*, transmission of nodes mi and m2 cannot be 
further compressed at node TO4. Therefore, the transmission 
of node 1114 consists of 2 components: 1) routed information 
from nodes mi and 1712, and 2)Innovative side-information at 
node TO4 w.r.t. all other source nodes in the network. Hence, 
R* must satisfy 

i?4 + i?g — i?2 ~ ^3 — H{Xmi\Xmi, ^m2 J ^ms)- (9) 

In order for client ti to recover the file, i.e., to gain access 
to XMt ' the incoming links to ii necessarily have to carry 
entire information about the process. In other words 



m 



Rg - H{XMtJ, 



(10) 



where the equality sign comes from the fact that the goal is 
to minimize the overall communication cost, and thus, it is 
wasteful for client ti to receive at rate larger than the joint 
entropy of the process. 

Considering all possible subsets of the source node set Ait^, 
we have that an optimal MM-rate vector R* must belong to 
the following rate-flow region 

dTZt, = {dR : dR{S) > H{Xs\Xm,^\s), V5 c Mt„ 

dRiMt,) ^ H{XmJ}, (11) 

where 



eeA + 5 eeA-S 



(12) 



and A+iS C f j^ , {A^S C £f^ ) denotes the set of links leaving 
(entering) S. For instance, if iS = {m^, 1114}, then the optimal 
rate vector R* satisfies 



dR*{s) = ri + r^-ri-r;-r; 



_ -'^ V-^?n3 1 -^m4 |-^mi 7 -^?7i2 



(13) 



It can be verified that any rate vector that belongs to the 
rate-flow region dTZti can be achieved using multi-terminal 
Slepian-Wolf random-binning scheme |9|. Thus, the rate-flow 
region dTZt^ contains all optimal MM-rate vectors w.rt. the 
optimization problem (|5]i. 



Extension of this result to a multiple client case is straight- 
forward: an optimal MM-rate vector has to simultaneously 
belong to all rate-flow regions dTZt^ which correspond to the 
graph Gt . , Vi^ e T. Hence, the optimization problem (|5]l can 
be written as 



R ^^^ 



mm > aeRe, 



(14) 



s.t. dR e dTZt, n dUt^ n ■ 

Re < Ce, Ve G £. 



n dTZt, 



Before we address the question of efficiently solving the 
problem ( fT4] i. first we need to answer whether or not the 
problem is feasible. 

IV. Feasibility of the Multisource Multicast 
Problem 

As in Section [nil first, we consider a single client case, i.e., 
when T = {ti}. Then, the obtained result naturally extends to 
the setting with arbitrary number of clients. Before we go any 
further, let us introduce some concepts from the combinatorial 
optimization theory which will turn out to be useful in proving 
our results. The set function / ; 2-^'i is supermodular if 

f{S) + f{T)<f{SuT) + f{SnT), V5,r 'ZMt,. 

(15) 

If the inequality sign in dTSl) is reversed, then the function / 
is called submodular. Let us define the polyhedron P{f) and 
the base polyhedron B{f) of a supermodular function / as 
follows. 

P{f) = {Z I Z £ R"\ V5 C Mt, : Z{S) > /(5)}, (16) 
B{f) ^ {Z I Z e P(/), Z{M) - f{M)t,}, (17) 

where Z{S) — J2ies ■^'- Analogously, we define the polyhe- 
dron and the base polyhedron of a submodular function (the 
inequality signs in (fTSI l and (1% are reversed). 
It is easy to show that function 



gt,{S) = H{Xs\Xm,^\s), V5 C Mt, 



(18) 



is supermodular Hence, the rate-flow region dTZt^ defined 
in ( fTTT l represents the base polyhedron of the function gt-^. 

Lemma 1. For a single client multisource multicast problem 
over Gt^ = {Vti,£ti), where Vt^ — {A4ti,ti}, there exists 
an achievable MM-rate vector, i.e. dTZti 7^ 0, <^nd Re < Cg, 
Ve e St-^, if and only if 



c{A+S)>H{Xs\Xm,,\s), ^SCMu, 



(19) 



where 



c(A+5)= J2 ce 

eeA+S 



A+S e£t,. 



Proof As we discussed in Section HUl the incoming links 
to ti carry entire information about the process. This combined 
with the fact that the goal is to minimize the communication 



cost, implies that for any optimal MM-rate vector R* it holds 
that 



22 R*e - H{XMt^)- 



(20) 



Therefore, without loss of generality we can assume that the 
capacities of the links incoming to ti satisfy 



J2 Ce^H{XM,^), 

e=(mj,ti)G£ti 



(21) 



provided that the feasible rate-flow region exists. It is not hard 
to show that the capacity function c{A'^S), VS C Mt^ is 
submodular (see Chapter 2 in LIOJ ). Let us denote by 9^, the 
set of the boundaries 9R of a feasible rate-flow region: 

a* = {<9R : i?e < Ce, Ve e £:* J (22) 

In ifTTl it was shown that 

a* = B(c(A+)). (23) 

From (l23T l and ( fT4] i it follows that there exists a feasible CO 
rate vector iff 



B{c{A+))nB{gt,)^ 



(24) 



Problem (l24l l is known as a common base problem fW\ for 
which the solution exists if and only if 



c(A+5)>gt,(5), V5CX, 



(25) 



This completes the proof of Lemma [T] ■ 

To verify whether there exists an achievable MM-rate vector 
it is necessary to check whether all 2l-^'i ' inequalities in ( fT9] l 
are satisfied. Verifying this is, in general, exponentially hard 
(in number of nodes). However, due to the supermodularity 
of the function gt^, the existence of a common base, and thus 
the feasibility of the multisource multicast problem, can be 
verified in polynomial timq3 (see [12j and 111 Oil . Chapter 4). 
This algorithm also provides an achievable MM-rate vector 
(given that it exists) that belongs to the rate-flow region dTZt-^. 
Extensions of the result of Lemma [T] to the case with 
arbitrary number of clients is straightforward. We just need 
to check if the inequalities ( fT9l) are satisfied for all clients in 

r. 

Theorem 1. For the multisource multicast problem over 
G{y,£), with the capacity function c, there exists an achiev- 
able MM-rate vector if and only if 



c{A+S)>H{Xs\Xm,^\s), 
yS<zMu, dA+Se£u, Vt, eT. 



(26) 



From lfT2l . the common base problem, and hence the feasi- 
bility of the multisource multicast problem can be verified in 
0{k ■ \£\^) time. 



V. Finding the Optimal MM-Rates w.r.t.the 
Linear Communication Cost 

In this section we propose a polynomial time deterministic 
algorithm that solves optimization problem (fl4t . As in Sec- 
tion llVI we begin by considering a single client case, i.e., when 
T = {ti}- Then, by using a similar methodology as in (3], 
we extend our solution to the arbitrary number of clients. 

A. Deterministic Algorithm for the Single Client Case 

When T = {^i}, then, the optimization problem (fT4l i can 
be written as 



mill y Rf., 

ee£ti 

s.t. dUeBigtJ, Re<Ce, yeeSt,, 



(27) 



where the supermodular set function gt^ is defined in dTSI ). 
Next, we introduce the dual set functions. For the function 
gti, its dual function ft^ can be obtained as follows: 

ft^{S) = gtAMt,)-gt^iMtAS), ySCMt,. (28) 



Applying formula ( l28i l, we obtain ft^ = H{Xs) which is 
a submodular function. Moreover, in jlTOI it was shown that 
B{gti) = -B(/ti). Hence we can replace B(gtJ with S(/tJ 
in (|23l. 

Optimization problem (l27l l has a form of the minimum cost 
submodular flow problem (see fTOl for formal definitions), but 
with a few differences listed bellow. 

1) In the submodular flow problem, function gt^ has to be 
defined over all vertices Vt^ of graph Gt-^. However, in 
our case gt-^ is a set function over the source vertices 
only. 

2) In the submodular flow problem, gt^ {Vt^ ) must evaluate 
to 0, whereas in our problem function gt^ is not defined 
forVt,. 

The first step of solving the problem (l27t efficiently involves 
verifying its feasibility. From the common base algorithm 
we obtain an achievable MM-rate vector that belongs to 
B{fti) provided that B{ft^) ^$. Given any achievable MM- 
rate vector that belongs to B{ft^), one can construct the 
auxiliary network over graph Gt^BI. It can be verified that from 
this step onwards, we can apply min-cost submodular flow 
algorithm [T0| which involves finding negative cycles of the 
auxiliary network, and updating the network accordingly along 
with the achievable MM-rate vector. Comparison between dif- 
ferent minimum cost submodular flow algorithms is provided 
inlH. 

B. Deterministic Algorithm for the Multiple Client Case 

In this section we extend the results from the previous 
section to the case where the set T contains arbitrary number 
of clients. Motivated by the results from (3\, the optimization 
problem (fl4l l can be written as follows 



-Complexity of the common base algorithm proposed in fl21 is 0{\£t-i P) 



See Chapter III of flOl for detailed explanation. 



mm 
z 



in^aeJ 



(29) 



s.t. Ze >i?i*'\ Vi, er, Veeft,, 

where 97?.t, is defined in (fTTT i for i = 1. Equivalence between 
the optimization problems (fl4l i and ( |29] l follows from the fact 
that transmissions on graph G have to be such that all clients 
in T learn the file simultaneously. 

Optimization problem (|29] l has an exponential number of 
constraints, which makes it challenging to solve in polynomial 
time. To obtain a polynomial time solution we consider the 
Lagrangian dual of problem 



max^(^(*''(A(*''), (30) 

1=1 

k 

S.t. ^ Ai*') = ae, A(*') > 0, Vt, e r, Ve G Eu, 



where 



R(t,) /-^ 



rc.) 



(31) 



eeSt 



s.t. aR(*') e dUu, Ri''^ < Ce, Ve e f*^. 

For any given ti £ T, the objective function dSTT l of the dual 
problem (|30] | can be computed in polynomial time as pointed 
out in Section lV-AI Hence, we can apply a subgradient method 
to solve the problem ( l30l l in polynomial time. 

Let R(*'' be the optimal rate tuple w.rt. the problem (l3Tt 
for some weight vector A'*'\ U e T- Starting with a 
feasible iterate A[0] w.r.t. the optimization problem (l30t . every 
subsequent iterate A[n] can be recursively represented as an 
Euclidian projection of the vector 

Ae[n]= Ae[n-l]+e[n-l]-'Re[n-l], Ve e £ (32) 

onto the hyperplane -^ A^ > 0| X]i=i ^e = cti k where 

Re[n - 1] = {R'i'\n - 1] : Vi^ G T}. The Euclidian 
projection ensures that every iterate A[n] is feasible w.r.t. the 
optimization problem ( l30l l. By appropriately choosing the step 
size 9[n] in each iteration, it is guaranteed that the subgradient 
method converges to the optimal solution of the problem (l30l l. 

To recover the primal optimal solution from the iterates A[n] 
we apply the results from |14|, where at each iteration n of 

J, the primal iterate is constructed as follows 



where 






E^5"^ = l' M?^>0, forj = l,2,. 



(33) 



(34) 



By carefully choosing the step size 6[n], Vn in (|32] | and the 
convex combination coefficients /i^" , Vj = l,...,n, Vn, it 
is guaranteed that (l33T l converges to the minimizer of (fl4t . 
and therefore to the minimizer of the original problem (0. 
In lfT4l . the authors proposed several choices for {^[n]} and 
{/^j } which lead to the primal recovery. Here we list some 
of them. 



1) e\n\ ^ y-^, Vn, where a > 0, 6 > 0, c> 0, 

1, . . . , n, Vn, 






2) 6'[n| = n"°, Vn, where < a < 1, 

A'j = ^' Vj = l,...,n, Vn. 
It is only left to compute an optimal MM-rate vector w.rt the 
linear objective defined in ( fT4l i. Let R* and Z* be the optimal 
rate vectors of the problems (fl4t and (|29t , respectively. As we 
pointed out R* = Z*, where Z* can be computed from R.[n] 
for a sufficiently large n, as follows 

Zl = max|i?(*i)[n],i?(*^'[n], . . . , i?(*'=)[n]| , Ve £ f . 

C. Cot/e Construction for the Linear Source Model 

In this Section we briefly address the question of the optimal 
code construction for the finite linear source model. We begin 
our analysis by considering the following example. 

Example 1. Consider a system with fc = 2 clients and / = 4 
source nodes presented in Figure [1] For convenience, we 
express the data vector asW=[a h c dJG Fl,, 
where a, 6, c, d are independent uniform random variables 
in Fg... Each source node has the following observations 
X„ii = {a, 6}, X„i2 = {&,c}, X„i3 = {c}, X,„4 = {d}. 
Let the objective function be X^eef ^e, with the capacity 
constraints Cg = 4, Ve g £. Applying the algorithm described 
in this section, we obtain 

Hi — -K4 — U, -K2 — -K5 — i, -K3 — Z, IIq — o, _Ky — 4. 




Fig. 3. Multicast network construction for tlie multisource multicast problem. 
We introduce a super node S that posses all the data packets, and transmits 
them to the respective nodes. 

Now, we briefly explain how to design the actual transmis- 
sions of each source node. Starting from an optimal MM-rate 



vector, we first construct the corresponding multicast network 
by adding a super node S that contains all individual packets 
in W (see Figure |3]l. Then, we apply the algebraic network 
coding approach IS), where the source matrix A is given by 



^ — [ -^mi 



m, Owxlfl 



(35) 



Finally, the network code for the multisource multicast prob- 
lem can be constructed in polynomial time from the algorithms 
provided in fTSl which are based on a simultaneous transfer 
matrix completion. 

In ISl, the authors derived the transfer matrix M(ri) from 
the super-node S to any receiver i^, i = 1, . . . , fc. It is a \£\ x 
\£\ matrix with the input vector W, and the output vector 
corresponding to the observations at the receiver ij. 



M{ti)=A{I-T)-^B{U), 



1, 



[12] 



where T is adjacency matrix of the multicast network, and 
B(ii) is an output matrix. For more details on how these 
matrices are constructed, we refer the interested reader to the 
reference |8|. Finally, given that \¥q\ > k, the network code C'^] 
for the multisource multicast problem can be constructed in 
polynomial time from the algorithms provided in y_5 1 which 
are based on a simultaneous transfer matrix completiorQ 

VI. Conclusion 

In this work we study the linear cost multisource multicast 
problem, where each node in the network observes i.i.d. copies 
of the DMMS process. Assuming that nodes can communicate 
over the capacity constrained links of the directed acyclic 
graph, the goal is for each client (sink of the graph), to learn 
the file, while minimizing a linear communication cost. First, 
we show that the underlying optimization problem can be 
posed as a linear program with exponentially many rate-flow 
constraints. Then, we provide the "capacity flow" conditions 
under which the multisource multicast problem is feasible. 
Applying the common base algorithm one can construct a test 
that verifies feasibility in polynomial time. We show that the 
linear cost multisource multicast problem with single client 
and many nodes can be solved in polynomial time by applying 
algorithms for the minimum cost submodular flow problem. 
Further, using the single client solution as a building block we 
show how one can solve a more general problem with arbitrary 
number of clients in polynomial time. For the special case of 
the finite linear source model, we propose a polynomial time 
algorithm that computes an explicit transmission scheme. 
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